popcompr is an R package to make it easier to compare different high resolution population datasets for humanitarian and research purposes. It is under active development and has not been released, but the code is available through a GPL3 license.

An example comparing 2019 population estimates for Lesotho from World Pop & Facebook/CIESIN

Included in the package are files to reproduce a minimal working example for the country of Lesotho. lesotho_wp_2019 and lesotho_fb_2019 are included in the package (use ?lesotho_wp_2019 for more details).

First load the necessary libraries

  • some extras for plotting (plotly & ggplot2)
library(popcompr)
#> Warning: replacing previous import 'data.table::shift' by 'raster::shift' when
#> loading 'popcompr'
library(raster)
#> Loading required package: sp
library(data.table)
#> 
#> Attaching package: 'data.table'
#> The following object is masked from 'package:raster':
#> 
#>     shift
library(foreach)
library(plotly)
#> Loading required package: ggplot2
#> 
#> Attaching package: 'plotly'
#> The following object is masked from 'package:ggplot2':
#> 
#>     last_plot
#> The following object is masked from 'package:raster':
#> 
#>     select
#> The following object is masked from 'package:stats':
#> 
#>     filter
#> The following object is masked from 'package:graphics':
#> 
#>     layout
library(ggplot2)
library(fasterize)
#> 
#> Attaching package: 'fasterize'
#> The following object is masked from 'package:graphics':
#> 
#>     plot
#> The following object is masked from 'package:base':
#> 
#>     plot
library(sf)
#> Linking to GEOS 3.8.1, GDAL 3.1.4, PROJ 6.3.1

Comparing populations at pixel level

You can first get an estimate of the time required to return the comparison:

# comparing at pixel level with data included in the package
lesotho_fb_2019 <- raster(system.file("external/lso_facebook_2019.tif", package="popcompr"))
lesotho_wp_2019 <- raster(system.file("external/lso_worldpop_2019.tif", package="popcompr"))

pop_list <- list(lesotho_wp_2019, lesotho_fb_2019)
compare_pop(pop_list, parallel = FALSE, 
                   estimate_time = TRUE)
#> It will take approximately 18.41 seconds to complete the full job serially.

It shouldn’t take too long by that estimate, but here we’ll work through parallelizing. Here, I use the doParallel backend, but other do packages can be used (anything compatible with the %dopar% infix in the foreach package).

# parallelized example
library(doParallel)
#> Loading required package: iterators
#> Loading required package: parallel
cl <- makeCluster(detectCores() - 1) # how many cores do we have available
registerDoParallel(cl)
system.time(
  # defaults to estimate_time = FALSE & resolution of ~ 1km at equator
  exe <- compare_pop(pop_list, parallel = TRUE) 
)
#> [1] "warning: 6.08053179085255 people were not matched."
#>    user  system elapsed 
#>   0.536   0.026  13.522
stopCluster(cl)

compare_pop will warn you if any people were not resampled to the comparison grid (this happens sometimes when pops are at the edge of the original raster).

The output is a raster brick with a layer corresponding to each of the input population rasters. We can vizualize and compare the rasters:

Map of differences
comp_dt <- data.table(as.data.frame(exe, xy = TRUE)) # we use data.table as it's easier to 

# transform functions for vizualizing difference
transform <- function(x) {
  logged <- log(abs(x) + 1e-6) * sign(x)
  return(logged)
}
inv_trans <- function(x) {
  inv <- (exp(abs(x)) - 1e-6) * sign(x)
  return(round(inv, 2))
}

map_compare <-
  ggplot(comp_dt) +
  geom_raster(aes(x = x, y = y, fill = transform(lso_worldpop_2019 - lso_facebook_2019))) +
  scale_fill_gradient2(labels = inv_trans, 
                       name = "Difference between \n population estimates") +
  coord_quickmap()
map_compare


hex_compare <-
  ggplot(comp_dt) +
  geom_hex(aes(x = lso_worldpop_2019, y = lso_facebook_2019), color = "grey") +
  geom_abline(intercept = 0, slope = 1, linetype = 2, color = "grey") +
  scale_fill_distiller(direction = 1, trans = "log",
                       labels = function(x) round(x, -1))
hex_compare
#> Warning: Removed 56799 rows containing non-finite values (stat_binhex).

You can also make these interactive using plotly/ggplotly functions:

ggplotly(map_compare) 
ggplotly(hex_compare) 
#> Warning: Removed 56799 rows containing non-finite values (stat_binhex).

Check the distribution of differences:

ggplot(comp_dt) +
  geom_histogram(aes(x =lso_worldpop_2019 - lso_facebook_2019))
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> Warning: Removed 56799 rows containing non-finite values (stat_bin).

Comparing at the administrative level

You can access country shapefiles using get_country_shape which will return a shapefile as an sf object. These are available through the geoBoundaries API, where more documentation can be found on available datasets. In addition, iso_codes is a dataset provided in the package to find country codes and available admin levels.

# Find the right iso code & see which admin levels are available
dplyr::filter(iso_codes, grepl("Les", country))
#>   iso_code   year admin_level                     source
#> 1      LSO 2017.0        ADM0              OpenStreetMap
#> 2      LSO 2017.0        ADM1              OpenStreetMap
#> 3      LSO 2020.0        ADM2 Humanitarian Data Exchange
#> 4      TLS 2017.0        ADM0              OpenStreetMap
#> 5      TLS 2017.0        ADM1              OpenStreetMap
#> 6      TLS 2018.0        ADM2                        HDX
#>                                         license     country
#> 1   Open Data Commons Open Database License 1.0     Lesotho
#> 2   Open Data Commons Open Database License 1.0     Lesotho
#> 3                                                   Lesotho
#> 4   Open Data Commons Open Database License 1.0 Timor-Leste
#> 5   Open Data Commons Open Database License 1.0 Timor-Leste
#> 6 Humanitarian use only - Noted in metadata tab Timor-Leste

We get the Lesotho shapefile to the admin level 2 (the type argument defaults to a simplified shapefile which is faster for plotting and downloading, but users may prefer unsimplified files for better accuracy see the geoBoundaries API documentation for more information).

les_shape <- get_country_shape(country_iso = "LSO", admin_level = 2)
#> Reading layer `geoBoundariesSSCU-3_0_0-LSO-ADM2' from data source `https://geoboundaries.org/data/geoBoundariesSSCU-3_0_0/LSO/ADM2/geoBoundariesSSCU-3_0_0-LSO-ADM2.geojson' using driver `GeoJSON'
#> Simple feature collection with 78 features and 5 fields
#> geometry type:  POLYGON
#> dimension:      XY
#> bbox:           xmin: 27.01125 ymin: -30.67785 xmax: 29.45737 ymax: -28.57103
#> geographic CRS: WGS 84

Then we can aggregate the population rasters to it:

# Get the shapefile at admin 2
les_shape <- aggregate_to_shp(brick = exe, sf = les_shape, max_adjacent = 100)
#> Warning in gmin(match, na.rm = TRUE): No non-missing values found in at least
#> one group. Returning 'Inf' for such groups to be consistent with base

#> Warning in gmin(match, na.rm = TRUE): No non-missing values found in at least
#> one group. Returning 'Inf' for such groups to be consistent with base

#> Warning in gmin(match, na.rm = TRUE): No non-missing values found in at least
#> one group. Returning 'Inf' for such groups to be consistent with base

#> Warning in gmin(match, na.rm = TRUE): No non-missing values found in at least
#> one group. Returning 'Inf' for such groups to be consistent with base

#> Warning in gmin(match, na.rm = TRUE): No non-missing values found in at least
#> one group. Returning 'Inf' for such groups to be consistent with base

#> Warning in gmin(match, na.rm = TRUE): No non-missing values found in at least
#> one group. Returning 'Inf' for such groups to be consistent with base

aggregate_to_shp will warn you if grid cell values go unallocated to the shapefile. In the case that many people go missing, you can set max_adjacent higher (this determines how many grid cells to buffer to when looking for the nearest non-NA neighbor), but it may also be wise to check the extent & boundaries of the shapefile vs. the population rasters.

We can vizualize these differences as well:

map_compare <-
  ggplot(les_shape) +
  geom_sf(aes(fill = transform(lso_worldpop_2019 - lso_facebook_2019))) +
  scale_fill_gradient2(labels = inv_trans, 
                       name = "Difference between \n population estimates")
map_compare


hex_compare <-
  ggplot(les_shape) +
  geom_hex(aes(x = lso_worldpop_2019, y = lso_facebook_2019), color = "grey") +
  geom_abline(intercept = 0, slope = 1, linetype = 2, color = "grey") +
  scale_fill_distiller(direction = 1, trans = "log",
                       labels = function(x) round(x, -1))
hex_compare

You can also make these interactive using plotly/ggplotly functions:

ggplotly(map_compare) 
ggplotly(hex_compare) 

Check the distribution of differences:

ggplot(les_shape) +
  geom_histogram(aes(x =lso_worldpop_2019 - lso_facebook_2019))
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

User specific inputs

You can use your own raster and shapefile inputs, they just need to be in the WGS84 (lat/long) coordinate system.